Take-home Exercise 3

To explain factors affecting the resale prices of public housing in Singapore by building hedonic pricing models using appropriate GWR methods.

Wong Wei Ling www.google.com
11-06-2021

1 Objectives

To build hedonic pricing models to explain factors affecting the resale prices of public housing in Singapore. The hedonic price models must be built by using appropriate GWR methods. This study focuses on four-room flat and transaction period from 1st January 2019 to 30th September 2020.

2 Dataset

3 Install and Load Packages

packages = c('tidyverse', 'sf', 'spdep', 'tmap', 'ggpubr',
             'corrplot', 'units', 'olsrr', 'plyr',
             'matrixStats', 'GWmodel', 'httr')

for (p in packages){
if(!require(p, character.only = T)){
  install.packages(p)
}
  library(p,character.only = T)
}

Explanation on the uses of each package:

4 Data Import and Preparation for Geospatial Dataset

In this section, we have to import and ensure that the geospatial data is in a format that we can use to perform analysis. We have to check if it is being assigned the right CRS and code. Also, since we are working in the boundary of Singapore, we have to ensure that it is in SVY21 with the EPSG code of 3414 being assigned as well.

The datasets that we will prepare in this section are:

4.1 Import geospatial data set

mpsz_sf <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL")
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `C:\wlwong2018\IS415 Blog\_posts\2021-11-06-take-home-exercise-3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
mrt_sf <- st_read(dsn = "data/geospatial", layer = "MRTLRTStnPtt")
Reading layer `MRTLRTStnPtt' from data source 
  `C:\wlwong2018\IS415 Blog\_posts\2021-11-06-take-home-exercise-3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 185 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6138.311 ymin: 27555.06 xmax: 45254.86 ymax: 47854.2
Projected CRS: SVY21
bus_sf <- st_read(dsn = "data/geospatial", layer = "BusStop")
Reading layer `BusStop' from data source 
  `C:\wlwong2018\IS415 Blog\_posts\2021-11-06-take-home-exercise-3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5156 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 4427.938 ymin: 26482.1 xmax: 48282.5 ymax: 52983.82
Projected CRS: SVY21

4.2 Look into datasets

glimpse(mpsz_sf)
Rows: 323
Columns: 16
$ OBJECTID   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15~
$ SUBZONE_NO <int> 1, 1, 3, 8, 3, 7, 9, 2, 13, 7, 12, 6, 1, 5, 1, 1,~
$ SUBZONE_N  <chr> "MARINA SOUTH", "PEARL'S HILL", "BOAT QUAY", "HEN~
$ SUBZONE_C  <chr> "MSSZ01", "OTSZ01", "SRSZ03", "BMSZ08", "BMSZ03",~
$ CA_IND     <chr> "Y", "Y", "Y", "N", "N", "N", "N", "Y", "N", "N",~
$ PLN_AREA_N <chr> "MARINA SOUTH", "OUTRAM", "SINGAPORE RIVER", "BUK~
$ PLN_AREA_C <chr> "MS", "OT", "SR", "BM", "BM", "BM", "BM", "SR", "~
$ REGION_N   <chr> "CENTRAL REGION", "CENTRAL REGION", "CENTRAL REGI~
$ REGION_C   <chr> "CR", "CR", "CR", "CR", "CR", "CR", "CR", "CR", "~
$ INC_CRC    <chr> "5ED7EB253F99252E", "8C7149B9EB32EEFC", "C35FEFF0~
$ FMEL_UPD_D <date> 2014-12-05, 2014-12-05, 2014-12-05, 2014-12-05, ~
$ X_ADDR     <dbl> 31595.84, 28679.06, 29654.96, 26782.83, 26201.96,~
$ Y_ADDR     <dbl> 29220.19, 29782.05, 29974.66, 29933.77, 30005.70,~
$ SHAPE_Leng <dbl> 5267.381, 3506.107, 1740.926, 3313.625, 2825.594,~
$ SHAPE_Area <dbl> 1630379.3, 559816.2, 160807.5, 595428.9, 387429.4~
$ geometry   <MULTIPOLYGON [m]> MULTIPOLYGON (((31495.56 30..., MULT~
glimpse(mrt_sf)
Rows: 185
Columns: 4
$ OBJECTID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ~
$ STN_NAME <chr> "EUNOS MRT STATION", "CHINESE GARDEN MRT STATION", ~
$ STN_NO   <chr> "EW7", "EW25", "NS14", "NS7", "EW18", "NS5", "EW28"~
$ geometry <POINT [m]> POINT (35782.96 33560.08), POINT (16790.75 36~
glimpse(bus_sf)
Rows: 5,156
Columns: 4
$ BUS_STOP_N <chr> "78221", "63359", "64141", "83139", "55231", "553~
$ BUS_ROOF_N <chr> "B06", "B01", "B13", "B07", "B02", "B03", "B10", ~
$ LOC_DESC   <chr> "BLK 231A CP", "HOUGANG SWIM CPLX", "AFT JLN TELA~
$ geometry   <POINT [m]> POINT (42227.96 39563.16), POINT (34065.75 ~

Results above show that:

4.3 Check for NA values

mpsz_sf[rowSums(is.na(mpsz_sf))!=0,]
Simple feature collection with 0 features and 15 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
 [1] OBJECTID   SUBZONE_NO SUBZONE_N  SUBZONE_C  CA_IND     PLN_AREA_N
 [7] PLN_AREA_C REGION_N   REGION_C   INC_CRC    FMEL_UPD_D X_ADDR    
[13] Y_ADDR     SHAPE_Leng SHAPE_Area geometry  
<0 rows> (or 0-length row.names)
mrt_sf[rowSums(is.na(mrt_sf))!=0,]
Simple feature collection with 0 features and 3 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] OBJECTID STN_NAME STN_NO   geometry
<0 rows> (or 0-length row.names)
bus_sf[rowSums(is.na(bus_sf))!=0,]
Simple feature collection with 1 feature and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 22616.75 ymin: 47793.68 xmax: 22616.75 ymax: 47793.68
Projected CRS: SVY21
    BUS_STOP_N BUS_ROOF_N LOC_DESC                  geometry
264      47201        UNK     <NA> POINT (22616.75 47793.68)

Results above show that:

4.4 Remove NA values

bus_sf <- bus_sf[!rowSums(is.na(bus_sf))!=0,]

4.5 Check for NA values

mpsz_sf[rowSums(is.na(mpsz_sf))!=0,]
Simple feature collection with 0 features and 15 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
 [1] OBJECTID   SUBZONE_NO SUBZONE_N  SUBZONE_C  CA_IND     PLN_AREA_N
 [7] PLN_AREA_C REGION_N   REGION_C   INC_CRC    FMEL_UPD_D X_ADDR    
[13] Y_ADDR     SHAPE_Leng SHAPE_Area geometry  
<0 rows> (or 0-length row.names)
mrt_sf[rowSums(is.na(mrt_sf))!=0,]
Simple feature collection with 0 features and 3 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] OBJECTID STN_NAME STN_NO   geometry
<0 rows> (or 0-length row.names)
bus_sf[rowSums(is.na(bus_sf))!=0,]
Simple feature collection with 0 features and 3 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] BUS_STOP_N BUS_ROOF_N LOC_DESC   geometry  
<0 rows> (or 0-length row.names)

Results above show that:

4.6 Check for duplicate values

mrt_sf[duplicated(mrt_sf$STN_NAME),]
Simple feature collection with 19 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 22949.03 ymin: 28735.78 xmax: 42362.71 ymax: 46418.56
Projected CRS: SVY21
First 10 features:
    OBJECTID                STN_NAME STN_NO                  geometry
33        33  ANG MO KIO MRT STATION   NS16 POINT (29807.27 39105.77)
97       105  MACPHERSON MRT STATION   DT26  POINT (34235.8 34256.43)
103      111       BUGIS MRT STATION   EW12 POINT (30491.56 31424.35)
114      122        EXPO MRT STATION   DT35 POINT (42362.71 35285.67)
116      124 BUONA VISTA MRT STATION   CC22 POINT (23251.76 32090.77)
121      129  MARINA BAY MRT STATION    CE2 POINT (30423.43 28735.78)
124      132   CHINATOWN MRT STATION   DT19 POINT (29238.97 29686.54)
134      142    TAMPINES MRT STATION   DT32 POINT (40213.03 37463.37)
140      148   SERANGOON MRT STATION   NE12 POINT (32480.07 36869.39)
144      152  PAYA LEBAR MRT STATION    CC9 POINT (34550.58 33300.34)
bus_sf[duplicated(bus_sf$BUS_STOP_N),]
Simple feature collection with 13 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 13488.02 ymin: 29969.22 xmax: 44144.57 ymax: 47806.75
Projected CRS: SVY21
First 10 features:
     BUS_STOP_N BUS_ROOF_N             LOC_DESC
350       58031        UNK      OPP CANBERRA DR
1569      51071        B21 MACRITCHIE RESERVOIR
2208      82221        B01               Blk 3A
2215      97079        B14  OPP ST. JOHN'S CRES
2392      62251        B03         BEF BLK 471B
2462      22501        B02             BLK 662A
2736      16119        NIL        TELETECH PARK
2976      58229       B01A          BLK 358A CP
3442      67421        NIL CHENG LIM STN EXIT B
3521      68091        B08         AFT BAKER ST
                      geometry
350   POINT (27089.69 47570.9)
1569 POINT (28305.37 36036.67)
2208 POINT (35308.74 33335.17)
2215 POINT (44144.57 38980.25)
2392 POINT (35500.36 39943.34)
2462 POINT (13488.02 35537.88)
2736 POINT (22318.89 29969.22)
2976 POINT (26094.86 47806.75)
3442 POINT (34741.77 42004.21)
3521 POINT (32038.84 43298.68)

Results above show that:

4.7 Remove duplicated values

mrt_sf <- mrt_sf[!duplicated(mrt_sf$STN_NAME),]
bus_sf <- bus_sf[!duplicated(bus_sf$BUS_STOP_N),]

4.8 Check for duplicated values

mrt_sf[duplicated(mrt_sf$STN_NAME),]
Simple feature collection with 0 features and 3 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] OBJECTID STN_NAME STN_NO   geometry
<0 rows> (or 0-length row.names)
bus_sf[duplicated(bus_sf$BUS_STOP_N),]
Simple feature collection with 0 features and 3 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] BUS_STOP_N BUS_ROOF_N LOC_DESC   geometry  
<0 rows> (or 0-length row.names)

Results above show that:

4.9 Check CRS

st_crs(mpsz_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["SVY21[WGS84]",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
st_crs(mrt_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["SVY21[WGS84]",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
st_crs(bus_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]

From the results above:

4.10 Assign CRS and check again

mpsz_sf <- st_set_crs(mpsz_sf, 3414)
mrt_sf <- st_set_crs(mrt_sf, 3414)
bus_sf <- st_set_crs(bus_sf, 3414)

st_crs(mpsz_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_crs(mrt_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_crs(bus_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

Results above show that:

4.11 Check for invalid geometries

length(which(st_is_valid(mpsz_sf) == FALSE))
[1] 9
length(which(st_is_valid(mrt_sf) == FALSE))
[1] 0
length(which(st_is_valid(bus_sf) == FALSE))
[1] 0

Results above show that:

4.12 Handle the invalid geometries and check again

mpsz_sf <- st_make_valid(mpsz_sf)

length(which(st_is_valid(mpsz_sf) == FALSE))
[1] 0
length(which(st_is_valid(mrt_sf) == FALSE))
[1] 0
length(which(st_is_valid(bus_sf) == FALSE))
[1] 0

Results above show that:

4.13 Plot geospatial data

tmap_mode('view')
tm_shape(mpsz_sf) +
  tm_polygons() +
tm_shape(bus_sf) +
  tm_dots(alpha = 0.4,
          col = "blue",
          size = 0.03)
tmap_mode('plot')

Results above show that:

4.14 Remove bus stops outside of Singapore boundary

4.14.1 Inspect names of bus stop

remove_busstop <- list(46211, 46219, 46239, 46609, 47701)
bus_sf[bus_sf$BUS_STOP_N %in% remove_busstop, ]
Simple feature collection with 5 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 17969.14 ymin: 49173.42 xmax: 20723.24 ymax: 52983.82
Projected CRS: SVY21 / Singapore TM
     BUS_STOP_N BUS_ROOF_N            LOC_DESC
765       47701        NIL          JB SENTRAL
1532      46239         NA          LARKIN TER
2257      46609         NA     KOTARAYA II TER
2269      46211        NIL JOHOR BAHRU CHECKPT
4346      46219        NIL JOHOR BAHRU CHECKPT
                      geometry
765  POINT (20370.54 49173.42)
1532 POINT (17969.14 52983.82)
2257  POINT (20025.46 49261.6)
2269 POINT (20623.18 49199.99)
4346 POINT (20723.24 49200.05)

Results above show that:

4.14.2 Remove bus stops from bus stop data

bus_sf <- bus_sf[!bus_sf$BUS_STOP_N %in% remove_busstop, ]

4.14.3 Check if bus stops are removed

bus_sf[bus_sf$BUS_STOP_N %in% remove_busstop, ]
Simple feature collection with 0 features and 3 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] BUS_STOP_N BUS_ROOF_N LOC_DESC   geometry  
<0 rows> (or 0-length row.names)

Results above show that:

5 Data Import and Preparation of Aspatial Dataset

In this section, we will prepare the aspatial data sets which will be our independent variables for calibrating the hedonic pricing models in the later section.

Note: This section will not be ran to save computational time. Documentation on the *results are still written even though the output is not shown.

The datasets that we will prepare in this section are:

5.1 Import aspatial data sets

resale <- read_csv("data/aspatial/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")
eldercare <- read_csv("data/aspatial/eldercare.csv")
hawker <- read_csv("data/aspatial/hawkercentre.csv")
park <- read_csv("data/aspatial/park.csv")
prisch <- read_csv("data/aspatial/general-information-of-schools.csv")
mall <- read_csv("data/aspatial/mall.csv")
supermarket <- read_csv("data/aspatial/supermarkets.csv")
kindergarten <- read_csv("data/aspatial/kindergarten.csv")
childcare <- read_csv("data/aspatial/childcare.csv")

5.2 Look into datasets

glimpse(resale)
glimpse(eldercare)
glimpse(hawker)
glimpse(park)
glimpse(prisch)
glimpse(mall)
glimpse(supermarket)
glimpse(kindergarten)
glimpse(childcare)

Results above show that:

5.3 Extract and filter necessary columns and data

resale <- resale %>%
  filter(flat_type == "4 ROOM",
         month >= "2019-01" & month <= "2020-09")

eldercare <- eldercare %>%
  select(NAME, Lng, Lat)

hawker <- hawker %>%
  select(NAME, Lng, Lat)

park <- park %>%
  select(NAME, Lng, Lat)

prisch <- prisch %>%
  filter(mainlevel_code == "PRIMARY") %>%
  select(school_name, address, gifted_ind)
  
supermarket <- supermarket %>%
  select(NAME, Lng, Lat)

kindergarten <- kindergarten %>%
  select(NAME, Lng, Lat)

childcare <- childcare %>%
  select(NAME, Lng, Lat)

5.4 Look into datasets again

glimpse(resale)
glimpse(eldercare)
glimpse(hawker)
glimpse(park)
glimpse(prisch)
glimpse(mall)
glimpse(supermarket)
glimpse(kindergarten)
glimpse(childcare)
# glimpse(bus)

Results above show that:

5.5 Check for NA values

resale[rowSums(is.na(resale))!=0,]
eldercare[rowSums(is.na(eldercare))!=0,]
hawker[rowSums(is.na(hawker))!=0,]
park[rowSums(is.na(park))!=0,]
prisch[rowSums(is.na(prisch))!=0,]
mall[rowSums(is.na(mall))!=0,]
supermarket[rowSums(is.na(supermarket))!=0,]
kindergarten[rowSums(is.na(kindergarten))!=0,]
childcare[rowSums(is.na(childcare))!=0,]

Results above show that:

5.6 Prepare resale Dataset

resale dataset has a section on its own as there are different preparation steps to be done:

5.6.1 Rename value in column

resale$street_name <- gsub("ST\\.", "SAINT", resale$street_name)

5.6.2 Combine columns for geo-coding

resale <- resale %>%
  mutate(`address` = paste(`block`, `street_name`, sep=' '))

5.6.3 Onemap API for geo-coding

5.6.3.1 Function to geo code

This function does the following:

library(httr)

geo_code <- function(address) {
  url <- "https://developers.onemap.sg/commonapi/search"
  query <- list("searchVal" = address, "returnGeom" = "Y", "getAddrDetails" = "N") #, "pageNum" = "1"
  res <- GET(url, query = query, verbose())

  output <- content(res) %>%
    as.data.frame() %>%
    select(results.X, results.Y, results.LONGITUDE, results.LATITUDE)

  return(output)
}

5.6.3.2 For loop to apply function (row 1 to 1000)

This code chunk:

resale$X <- 0
resale$Y <- 0
resale$LONGITUDE <- 0 
resale$LATITUDE <- 0 

for (i in 1:1000) {
  output <- geo_code(resale$address[i])

  resale$X[i] <- output$results.X
  resale$Y[i] <- output$results.Y
  resale$LONGITUDE[i] <- output$results.LONGITUDE
  resale$LATITUDE[i] <- output$results.LATITUDE
}

5.6.3.3 For loop to apply function (row 1001 and beyond)

for (i in 15001:15901) {
  output <- geo_code(resale$address[i])
  
  if (resale$X[i] == 0) {
    resale$X[i] <- output$results.X
    resale$Y[i] <- output$results.Y
    resale$LONGITUDE[i] <- output$results.LONGITUDE
    resale$LATITUDE[i] <- output$results.LATITUDE
  }
}

5.6.3.4 Write into .csv file

write_csv(resale, "data/aspatial/final_resale_prices_XY_LongLat.csv")

5.6.4 Import .csv file

resale_new <- read_csv("data/aspatial/final_resale_prices_XY_LongLat.csv")
glimpse(resale_new)

Results above show that:

5.6.5 Extract necessary columns

resale_new <- resale_new %>%
  select(month, town, flat_type, storey_range, floor_area_sqm, remaining_lease, resale_price, address, LONGITUDE, LATITUDE)

5.6.6 Manipulate storey_range column

resale_new <- resale_new %>%
  mutate(yesno = 1) %>%
  distinct %>%
  pivot_wider(names_from = storey_range, values_from = yesno, values_fill = 0)

5.6.7 Manipulate remaining_lease column

resale_new$remaining_lease_new <- gsub("years", "", paste(resale_new$remaining_lease))
resale_new$remaining_lease_new <- gsub("month", "", paste(resale_new$remaining_lease_new))
resale_new$remaining_lease_new <- gsub("s", "", paste(resale_new$remaining_lease_new))

resale_new$remaining_lease_yr <- substr(resale_new$remaining_lease_new, start = 1, stop = 2)
resale_new$remaining_lease_mth <- substring(resale_new$remaining_lease_new, 5, 6)

resale_new$remaining_lease_yr <- as.double(resale_new$remaining_lease_yr)
resale_new$remaining_lease_mth <- as.double(resale_new$remaining_lease_mth)


resale_new$remaining_lease_mth[is.na(resale_new$remaining_lease_mth)] <- 0

resale_new <- resale_new %>%
  mutate(`remaining_lease_final` = `remaining_lease_yr` + round((`remaining_lease_mth`/12),2))

drop <- c("remaining_lease_new", "remaining_lease_yr", "remaining_lease_mth")
resale_new <- resale_new[, !names(resale_new) %in% drop]

5.7 Prepare prisch (primary school) Dataset

prisch dataset has a section on its own as there are different preparation steps to be done:

5.7.1 Apply geo_code function

This code chunk:

prisch$LONGITUDE <- 0 
prisch$LATITUDE <- 0 

count <- 0
failed_count <- 0

for (i in 1:183){
  tryCatch( {
    output <- geo_code(prisch$address[i])
    count <- count + 1
    prisch$LONGITUDE[i] <- output$results.LONGITUDE
    prisch$LATITUDE[i] <- output$results.LATITUDE

  }, error = function(err) {
      count <- count + 1
      failed_count <- failed_count + 1
      cat('Failed to extract',count,'out of',length(prisch$address),'addresses')
    }
  )
}

5.7.2 Write output into a CSV file

write_csv(prisch, "data/aspatial/prisch.csv")

5.7.3 Import prisch Dataset

prisch <- read_csv("data/aspatial/prisch.csv")
glimpse(prisch)

Results above show that:

5.7.4 Verify that all primary schools have Latitude & Longitude

5.7.4.1 Find primary schools with missing Latitude & Longitude

prisch$school_name[prisch$LONGITUDE==0.0000]

Results above show that:

5.7.4.2 Assign Latitude & Longitude

prisch$LONGITUDE[prisch$school_name == "HOUGANG PRIMARY SCHOOL"] <- 103.881072
prisch$LATITUDE[prisch$school_name == "HOUGANG PRIMARY SCHOOL"] <- 1.3783581

prisch$LONGITUDE[prisch$school_name == "JING SHAN PRIMARY SCHOOL"] <- 103.8520152
prisch$LATITUDE[prisch$school_name == "JING SHAN PRIMARY SCHOOL"] <- 1.3722578

prisch$LONGITUDE[prisch$school_name == "WEST GROVE PRIMARY SCHOOL"] <- 103.6989833
prisch$LATITUDE[prisch$school_name == "WEST GROVE PRIMARY SCHOOL"] <- 1.3446809

prisch$LONGITUDE[prisch$school_name == "HWHITE SANDS PRIMARY SCHOOL"] <- 103.9612546
prisch$LATITUDE[prisch$school_name == "WHITE SANDS PRIMARY SCHOOL"] <- 1.365473

5.7.5 Filter good primary school

gd_prisch <- prisch %>%
  filter(gifted_ind == "Yes")

Results above show that:

6 Geospatial Wrangling

Note: This section will not be ran except for Section 6.5 and Section 6.6 to save computational time. Documentation on the results are still written even though the output is not shown.

The variables that we will prepare in this section are:

Locational factors

We will then combine the above variables with resale dataset which consists of the prepared Structural factors such as:

6.1 Convert Aspatial Dataframe into a sf object

6.1.1 Function to convert aspatial dataframe into a sf object

This function does the following:

convert_sf <- function(df, x, y) {
  st_as_sf(df,
            coords = c(x, y),
            crs=4326) %>%
  st_transform(crs=3414)
}

6.1.2 Apply function to convert aspatial dataframes into a sf object

resale_sf <- convert_sf(resale_new, "LONGITUDE", "LATITUDE")
eldercare_sf <- convert_sf(eldercare, "Lng", "Lat")
hawker_sf <- convert_sf(hawker, "Lng", "Lat")
park_sf <- convert_sf(park, "Lng", "Lat")
gd_prisch_sf <- convert_sf(gd_prisch, "LONGITUDE", "LATITUDE")
mall_sf <- convert_sf(mall, "longitude", "latitude")
supermarket_sf <- convert_sf(supermarket, "Lng", "Lat")
kindergarten_sf <- convert_sf(kindergarten, "Lng", "Lat")
childcare_sf <- convert_sf(childcare, "Lng", "Lat")
prisch_sf <- convert_sf(prisch, "LONGITUDE", "LATITUDE")

6.2 Prepare Independent Variables

6.2.1 Central Business District (CBD) of Singapore

longitude <- 103.851784
latitude <- 1.287953

cbd_coorinates_sf <- data.frame(longitude, latitude) %>%
  st_as_sf(., coords = c("longitude", "latitude"), crs=4326) %>%
  st_transform(crs=3414)

6.2.2 Function to calculate proximity of variables

This function does the following:

calulate_prox <- function(df1, df2, var) {
  distance_matrix <- st_distance(df1, df2) %>%
    drop_units()
  df1[,var] <- rowMins(distance_matrix)

  return(df1)
}

6.2.3 Apply function to calculate proximity

resale_sf <- calulate_prox(resale_sf, cbd_coorinates_sf, "cbd_prox") %>%
              calulate_prox(., eldercare_sf, "eldercare_prox") %>%
              calulate_prox(., hawker_sf, "hawker_prox") %>% 
              calulate_prox(., mrt_sf, "mrt_prox") %>%
              calulate_prox(., park_sf, "park_prox") %>%
              calulate_prox(., gd_prisch_sf, "gd_prisch_prox") %>%
              calulate_prox(., mall_sf, "mall_prox") %>%
              calulate_prox(., supermarket_sf, "supermarket_prox") 

6.2.4 Function to calculate number of variables within a certain distance

This function does the following:

calculate_num <- function(df1, df2, distance, var) {
  distance_matrix <- st_distance(df1, df2) %>% 
    drop_units()
  distance_matrix <- data.frame(distance_matrix)
  df1[,var] <- rowSums(distance_matrix <= distance)
  
  return(df1)
}

6.2.5 Apply function to calculate number of variables within a certain distance

resale_sf <- calculate_num(resale_sf, kindergarten_sf, 350, "kindergarten_num") %>%
              calculate_num(., childcare_sf, 350, "childcare_num") %>%
              calculate_num(., bus_sf, 350, "busstop_num") %>%
              calculate_num(., prisch_sf, 1000, "prisch_num")

6.3 Re-arrange columns in resale_sf

col_order <- c("month", "town", "flat_type", "remaining_lease", "address",
              "resale_price", "floor_area_sqm", "remaining_lease_final", 
              "01 TO 03", "04 TO 06", "07 TO 09", "10 TO 12", "13 TO 15", "16 TO 18",
              "19 TO 21", "22 TO 24", "25 TO 27", "28 TO 30", "31 TO 33", "34 TO 36",
              "37 TO 39", "40 TO 42", "43 TO 45", "46 TO 48", "49 TO 51",
              'cbd_prox', "eldercare_prox", "hawker_prox", "mrt_prox", "park_prox",
               "gd_prisch_prox", "mall_prox", "supermarket_prox",
              "kindergarten_num", "childcare_num", "busstop_num", "prisch_num", "geometry")
resale_sf <- resale_sf[, col_order]

6.4 Write to a shp file

st_write(resale_sf, "data/geospatial/final_resale.shp")

6.5 Import shp file

resale_sf <- st_read(dsn="data/geospatial", layer="final_resale")
Reading layer `final_resale' from data source 
  `C:\wlwong2018\IS415 Blog\_posts\2021-11-06-take-home-exercise-3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 15853 features and 37 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11597.31 ymin: 28217.39 xmax: 42623.63 ymax: 48741.06
Projected CRS: SVY21 / Singapore TM

6.6 Rename columns

resale_sf <- resale_sf %>%
  dplyr::rename(resale_price = rsl_prc, floor_area_sqm = flr_r_s, remaining_lease_final = rmnng__,
         lvl_01_TO_03 = X01TO03, lvl_04_TO_06 = X04TO06, lvl_07_TO_09 = X07TO09,
         lvl_10_TO_12 = X10TO12, lvl_13_TO_15 = X13TO15, lvl_16_TO_18 = X16TO18,
         lvl_19_TO_21 = X19TO21, lvl_22_TO_24 = X22TO24, lvl_25_TO_27 = X25TO27,
         lvl_28_TO_30 = X28TO30, lvl_31_TO_33 = X31TO33, lvl_34_TO_36 = X34TO36,
         lvl_37_TO_39 = X37TO39, lvl_40_TO_42 = X40TO42, lvl_43_TO_45 = X43TO45,
         lvl_46_TO_48 = X46TO48, lvl_49_TO_51 = X49TO51,
         cbd_prox = cbd_prx, eldercare_prox = eldrcr_, hawker_prox = hwkr_pr,
         mrt_prox = mrt_prx, park_prox = prk_prx, gd_prisch_prox = gd_prs_,
         mall_prox = mll_prx, supermarket_prox = sprmrk_, kindergarten_num  = kndrgr_,
         childcare_num  = chldcr_, busstop_num  = bsstp_n, prisch_num  = prsch_n)

7 Exploratory Data Analysis

In this section, we will perform some Exploratory Data Analysis to understand our dataset.

7.1 EDA using statistical graphics

7.1.1 Function to Plot Histogram

mul_hist <- function(df, vnam, x_axis) {
  ggplot(data=df, aes(x=vnam)) +
  geom_histogram(bins=20, color="black", fill="salmon") +
        scale_x_continuous(name = x_axis)
}

7.1.2 Plot distribution of resale_price

mul_hist(resale_sf, resale_sf$resale_price, "Resale Price")

Results above show that:

7.1.3 Normalise using Log Transformation

resale_sf <- resale_sf %>%
  mutate(`resale_price_log` = log(resale_price))

7.1.4 Plot resale_price_log

mul_hist(resale_sf, resale_sf$resale_price_log, "Resale Price Log")

Results above show that:

7.2 Multiple Plots distribution of variables

7.2.1 Continuous variables - Histogram

ggarrange(mul_hist(resale_sf, resale_sf$floor_area_sqm, "Floor Area Sqm"),
          mul_hist(resale_sf, resale_sf$remaining_lease_final, "Remaining Lease"),
          mul_hist(resale_sf, resale_sf$cbd_prox, "Proximity to CBD"),
          mul_hist(resale_sf, resale_sf$eldercare_prox, "Proximity to Eldercare"),
          mul_hist(resale_sf, resale_sf$hawker_prox, "Proximity to Hawker Centre"),
          mul_hist(resale_sf, resale_sf$mrt_prox, "Proximity to Mrt"),
          mul_hist(resale_sf, resale_sf$park_prox, "Proximity to Park"),
          mul_hist(resale_sf, resale_sf$gd_prisch_prox, "Proximity to Good Pri. School"),
          mul_hist(resale_sf, resale_sf$mall_prox, "Proximity to Shopping Mall"),
          mul_hist(resale_sf, resale_sf$supermarket_prox, "Proximity to Supermarket"),
    ncol = 2, nrow = 5)

Results above show that:

7.2.2 Discrete variables - Bar Chart

7.2.2.1 Function to plot bar charts

mul_bar <- function(df, vnam, x_axis){
  ggplot(data=df, aes(x=factor(vnam))) +
    geom_bar(color="black", fill="darkseagreen3")   +
         scale_x_discrete(name = x_axis)
}

7.2.2.2 Apply function to plot bar charts

ggarrange(mul_bar(resale_sf, resale_sf$kindergarten_num, "No. of Kindergartens (within 350m)"),
          mul_bar(resale_sf, resale_sf$childcare_num, "No. of Childcare Centres (within 350m)"),
          mul_bar(resale_sf, resale_sf$busstop_num, "No. of Bus Stops (within 350m)"),
          mul_bar(resale_sf, resale_sf$prisch_num, "No. of Pri. School (within 1km)"),
      ncol = 2, nrow = 2)

Results above show that:

7.2.3 Binary variables - Bar Chart

7.2.3.1 Function to plot bar charts

mul_bar_level <- function(df, vnam, x_axis){
  ggplot(data=df, aes(x=factor(vnam))) +
    geom_bar(color="black", fill="steelblue3") +  
         scale_x_discrete(name = x_axis) + coord_flip()
}

7.2.3.2 Appply function to plot bar charts

ggarrange(mul_bar_level(resale_sf, resale_sf$lvl_01_TO_03, "Level 1 to 3"),
          mul_bar_level(resale_sf, resale_sf$lvl_04_TO_06, "Level 4 to 6"),
          mul_bar_level(resale_sf, resale_sf$lvl_07_TO_09, "Level 7 to 9"),
          mul_bar_level(resale_sf, resale_sf$lvl_10_TO_12, "Level 10 to 12"),
          mul_bar_level(resale_sf, resale_sf$lvl_13_TO_15, "Level 13 to 15"),
          mul_bar_level(resale_sf, resale_sf$lvl_16_TO_18, "Level 16 to 18"),
          mul_bar_level(resale_sf, resale_sf$lvl_19_TO_21, "Level 19 to 21"),
          mul_bar_level(resale_sf, resale_sf$lvl_22_TO_24, "Level 22 to 24"),
          mul_bar_level(resale_sf, resale_sf$lvl_25_TO_27, "Level 25 to 27"),
          mul_bar_level(resale_sf, resale_sf$lvl_28_TO_30, "Level 28 to 30"),
          mul_bar_level(resale_sf, resale_sf$lvl_31_TO_33, "Level 31 to 33"),
          mul_bar_level(resale_sf, resale_sf$lvl_34_TO_36, "Level 34 to 36"),
          mul_bar_level(resale_sf, resale_sf$lvl_37_TO_39, "Level 37 to 39"),
          mul_bar_level(resale_sf, resale_sf$lvl_40_TO_42, "Level 40 to 42"),
          mul_bar_level(resale_sf, resale_sf$lvl_43_TO_45, "Level 43 to 45"),
          mul_bar_level(resale_sf, resale_sf$lvl_46_TO_48, "Level 46 to 48"),
          mul_bar_level(resale_sf, resale_sf$lvl_49_TO_51, "Level 49 to 51"),
      ncol = 2, nrow = 9)

Results above show that:

7.3 Drawing Statistical Point Map

tmap_options(check.and.fix = TRUE)

tmap_mode("view")

tm_shape(mpsz_sf)+
  tm_polygons() +
tm_shape(resale_sf) +
  tm_dots(col = "resale_price",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))
tmap_mode("plot")

Results above show that:

8 Hedonic Pricing Modelling in R

In this section, we will build hedonic pricing models for HDB resale units using lm() of Base R.

8.1 Multiple Linear Regression Method

8.1.1 Visualise relationships of independent variables

Before we start calibrating the Hedonic Pricing Model, we have to ensure that the independent variables used are not highly correlated to each other. This is called multicollinearity in statistics.

8.1.1.1 Drop geometry column

resale_sf_nogeom <- resale_sf %>%
  st_drop_geometry()

8.1.1.2 Plot correlation plot

corrplot(cor(resale_sf_nogeom[, 7:37]), diag = FALSE, order = "AOE",
         tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")

Results above show that:

8.1.2 Build hedonic pricing model using multiple linear regression method

resale.mlr <- lm(formula = resale_price ~ floor_area_sqm + remaining_lease_final + lvl_01_TO_03 + lvl_04_TO_06 + lvl_07_TO_09 + lvl_10_TO_12 + lvl_13_TO_15 + lvl_16_TO_18 + lvl_19_TO_21 + lvl_22_TO_24 + lvl_25_TO_27 + lvl_28_TO_30 + lvl_31_TO_33 + lvl_34_TO_36 + lvl_37_TO_39 + lvl_40_TO_42 + lvl_43_TO_45 + lvl_46_TO_48 + lvl_49_TO_51 + cbd_prox + eldercare_prox  + hawker_prox  + mrt_prox + park_prox  + gd_prisch_prox + mall_prox + supermarket_prox  + kindergarten_num + childcare_num + busstop_num + prisch_num, data=resale_sf)

summary(resale.mlr)

Call:
lm(formula = resale_price ~ floor_area_sqm + remaining_lease_final + 
    lvl_01_TO_03 + lvl_04_TO_06 + lvl_07_TO_09 + lvl_10_TO_12 + 
    lvl_13_TO_15 + lvl_16_TO_18 + lvl_19_TO_21 + lvl_22_TO_24 + 
    lvl_25_TO_27 + lvl_28_TO_30 + lvl_31_TO_33 + lvl_34_TO_36 + 
    lvl_37_TO_39 + lvl_40_TO_42 + lvl_43_TO_45 + lvl_46_TO_48 + 
    lvl_49_TO_51 + cbd_prox + eldercare_prox + hawker_prox + 
    mrt_prox + park_prox + gd_prisch_prox + mall_prox + supermarket_prox + 
    kindergarten_num + childcare_num + busstop_num + prisch_num, 
    data = resale_sf)

Residuals:
    Min      1Q  Median      3Q     Max 
-189805  -38452   -1878   34805  453664 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            2.002e+05  1.333e+04  15.017  < 2e-16 ***
floor_area_sqm         2.798e+03  7.169e+01  39.027  < 2e-16 ***
remaining_lease_final  4.166e+03  4.431e+01  94.010  < 2e-16 ***
lvl_01_TO_03          -7.412e+04  1.042e+04  -7.114 1.17e-12 ***
lvl_04_TO_06          -5.389e+04  1.039e+04  -5.188 2.15e-07 ***
lvl_07_TO_09          -4.024e+04  1.037e+04  -3.879 0.000105 ***
lvl_10_TO_12          -3.264e+04  1.037e+04  -3.147 0.001653 ** 
lvl_13_TO_15          -2.783e+04  1.039e+04  -2.679 0.007401 ** 
lvl_16_TO_18          -2.223e+04  1.051e+04  -2.115 0.034459 *  
lvl_19_TO_21          -2.551e+02  1.089e+04  -0.023 0.981307    
lvl_22_TO_24          -1.769e+03  1.100e+04  -0.161 0.872313    
lvl_25_TO_27           3.888e+04  1.165e+04   3.338 0.000846 ***
lvl_28_TO_30           1.086e+05  1.203e+04   9.029  < 2e-16 ***
lvl_31_TO_33           1.260e+05  1.445e+04   8.717  < 2e-16 ***
lvl_34_TO_36           1.187e+05  1.513e+04   7.841 4.74e-15 ***
lvl_37_TO_39           1.797e+05  1.489e+04  12.070  < 2e-16 ***
lvl_40_TO_42           2.154e+05  1.652e+04  13.037  < 2e-16 ***
lvl_43_TO_45           2.540e+05  3.620e+04   7.017 2.37e-12 ***
lvl_46_TO_48           2.720e+05  3.622e+04   7.509 6.28e-14 ***
lvl_49_TO_51           3.385e+05  4.373e+04   7.741 1.05e-14 ***
cbd_prox              -1.772e+01  1.959e-01 -90.412  < 2e-16 ***
eldercare_prox        -1.580e+01  7.881e-01 -20.055  < 2e-16 ***
hawker_prox           -1.965e+01  1.020e+00 -19.265  < 2e-16 ***
mrt_prox              -3.220e+01  1.365e+00 -23.590  < 2e-16 ***
park_prox             -1.026e+01  1.207e+00  -8.503  < 2e-16 ***
gd_prisch_prox         2.260e+00  2.283e-01   9.901  < 2e-16 ***
mall_prox             -2.013e+01  1.472e+00 -13.676  < 2e-16 ***
supermarket_prox      -2.376e+01  3.293e+00  -7.216 5.60e-13 ***
kindergarten_num       7.348e+03  5.013e+02  14.657  < 2e-16 ***
childcare_num         -4.168e+03  2.778e+02 -15.004  < 2e-16 ***
busstop_num            8.111e+02  1.775e+02   4.570 4.92e-06 ***
prisch_num            -8.710e+03  3.893e+02 -22.373  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59990 on 15821 degrees of freedom
Multiple R-squared:  0.7514,    Adjusted R-squared:  0.7509 
F-statistic:  1542 on 31 and 15821 DF,  p-value: < 2.2e-16

Results above show that:

ols_regress() to print revised model summary and parameter estimates.

resale.mlr1 <- lm(formula = resale_price ~ floor_area_sqm + remaining_lease_final + lvl_01_TO_03 + lvl_04_TO_06 + lvl_07_TO_09 + lvl_10_TO_12 + lvl_13_TO_15 + lvl_16_TO_18 + lvl_25_TO_27 + lvl_28_TO_30 + lvl_31_TO_33 + lvl_34_TO_36 + lvl_37_TO_39 + lvl_40_TO_42 + lvl_43_TO_45 + lvl_46_TO_48 + lvl_49_TO_51 + cbd_prox + eldercare_prox  + hawker_prox  + mrt_prox + park_prox  + gd_prisch_prox + mall_prox + supermarket_prox  + kindergarten_num + childcare_num + busstop_num + prisch_num, data=resale_sf)

ols_regress(resale.mlr1)
                            Model Summary                              
----------------------------------------------------------------------
R                       0.867       RMSE                    59985.065 
R-Squared               0.751       Coef. Var                  13.835 
Adj. R-Squared          0.751       MSE                3598208060.583 
Pred R-Squared          0.750       MAE                     46091.675 
----------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                     ANOVA                                       
--------------------------------------------------------------------------------
                    Sum of                                                      
                   Squares           DF       Mean Square       F          Sig. 
--------------------------------------------------------------------------------
Regression    1.720769e+14           29      5.933685e+12    1649.067    0.0000 
Residual      5.693445e+13        15823    3598208060.583                       
Total         2.290113e+14        15852                                         
--------------------------------------------------------------------------------

                                              Parameter Estimates                                               
---------------------------------------------------------------------------------------------------------------
                model          Beta    Std. Error    Std. Beta       t        Sig          lower         upper 
---------------------------------------------------------------------------------------------------------------
          (Intercept)    199337.591      8764.453                  22.744    0.000    182158.264    216516.917 
       floor_area_sqm      2798.007        71.673        0.165     39.039    0.000      2657.520      2938.495 
remaining_lease_final      4165.887        44.265        0.445     94.113    0.000      4079.123      4252.651 
         lvl_01_TO_03    -73296.939      2946.201       -0.232    -24.878    0.000    -79071.827    -67522.050 
         lvl_04_TO_06    -53071.218      2882.078       -0.186    -18.414    0.000    -58720.418    -47422.017 
         lvl_07_TO_09    -39416.337      2884.798       -0.133    -13.663    0.000    -45070.869    -33761.806 
         lvl_10_TO_12    -31816.199      2901.167       -0.103    -10.967    0.000    -37502.817    -26129.580 
         lvl_13_TO_15    -27011.077      3045.041       -0.067     -8.871    0.000    -32979.704    -21042.450 
         lvl_16_TO_18    -21411.594      3367.917       -0.039     -6.358    0.000    -28013.096    -14810.093 
         lvl_25_TO_27     39697.132      6014.804        0.029      6.600    0.000     27907.432     51486.832 
         lvl_28_TO_30    109436.846      6938.411        0.068     15.773    0.000     95836.770    123036.922 
         lvl_31_TO_33    126797.840     10363.903        0.050     12.235    0.000    106483.409    147112.271 
         lvl_34_TO_36    119499.374     11294.187        0.043     10.581    0.000     97361.480    141637.268 
         lvl_37_TO_39    180558.879     10969.731        0.067     16.460    0.000    159056.956    202060.801 
         lvl_40_TO_42    216253.102     13089.795        0.067     16.521    0.000    190595.613    241910.591 
         lvl_43_TO_45    254868.128     34767.850        0.029      7.331    0.000    186719.181    323017.076 
         lvl_46_TO_48    272834.979     34788.857        0.031      7.843    0.000    204644.856    341025.101 
         lvl_49_TO_51    339310.275     42543.558        0.032      7.976    0.000    255920.055    422700.495 
             cbd_prox       -17.716         0.196       -0.611    -90.425    0.000       -18.100       -17.332 
       eldercare_prox       -15.804         0.788       -0.087    -20.055    0.000       -17.348       -14.259 
          hawker_prox       -19.647         1.020       -0.084    -19.267    0.000       -21.645       -17.648 
             mrt_prox       -32.200         1.365       -0.104    -23.590    0.000       -34.876       -29.525 
            park_prox       -10.266         1.207       -0.038     -8.504    0.000       -12.632        -7.899 
       gd_prisch_prox         2.261         0.228        0.058      9.903    0.000         1.813         2.708 
            mall_prox       -20.128         1.472       -0.064    -13.676    0.000       -23.013       -17.243 
     supermarket_prox       -23.756         3.293       -0.031     -7.215    0.000       -30.210       -17.302 
     kindergarten_num      7348.450       501.243        0.062     14.660    0.000      6365.957      8330.943 
        childcare_num     -4167.025       277.731       -0.068    -15.004    0.000     -4711.410     -3622.640 
          busstop_num       810.821       177.478        0.020      4.569    0.000       462.944      1158.698 
           prisch_num     -8707.578       389.089       -0.112    -22.379    0.000     -9470.237     -7944.920 
---------------------------------------------------------------------------------------------------------------

8.1.3 Test for Multicollinearity

To perform regression modelling on spatial data, we first have to ensure that the residuals are not correlated with each other.

ols_vif_tol(resale.mlr1)
               Variables Tolerance      VIF
1         floor_area_sqm 0.8749787 1.142885
2  remaining_lease_final 0.7029250 1.422627
3           lvl_01_TO_03 0.1808740 5.528711
4           lvl_04_TO_06 0.1539728 6.494654
5           lvl_07_TO_09 0.1657612 6.032773
6           lvl_10_TO_12 0.1785631 5.600261
7           lvl_13_TO_15 0.2743421 3.645084
8           lvl_16_TO_18 0.4161241 2.403129
9           lvl_25_TO_27 0.8149313 1.227097
10          lvl_28_TO_30 0.8540849 1.170844
11          lvl_31_TO_33 0.9326623 1.072199
12          lvl_34_TO_36 0.9420581 1.061506
13          lvl_37_TO_39 0.9363147 1.068017
14          lvl_40_TO_42 0.9558749 1.046162
15          lvl_43_TO_45 0.9924109 1.007647
16          lvl_46_TO_48 0.9912128 1.008865
17          lvl_49_TO_51 0.9941306 1.005904
18              cbd_prox 0.3440807 2.906295
19        eldercare_prox 0.8325543 1.201123
20           hawker_prox 0.8210584 1.217940
21              mrt_prox 0.8015666 1.247557
22             park_prox 0.7736125 1.292637
23        gd_prisch_prox 0.4615442 2.166640
24             mall_prox 0.7133221 1.401891
25      supermarket_prox 0.8461031 1.181889
26      kindergarten_num 0.8831753 1.132278
27         childcare_num 0.7568531 1.321260
28           busstop_num 0.8561472 1.168023
29            prisch_num 0.6316363 1.583190

Results above show that:

8.1.4 Test for Non-Linearity

ols_plot_resid_fit(resale.mlr1)

Results above show that:

8.1.5 Test for Normality Assumption

ols_plot_resid_hist(resale.mlr1)

Results above show that:

8.1.6 Test for Spatial Autocorrelation

The hedonic model is using geographically referenced attributes, hence it is also important for us to visualise the residual of the hedonic pricing model.

Since we also have to assume that the residuals are not distributed randomly, we need to combine resale_sf with resale.mlr1, then convert it into a SpatialPointsDataFrame to perform spatial autocorrelation test.

8.1.6.1 Export residual of hedonic pricing model

mlr.output <- as.data.frame(resale.mlr1$residuals)

8.1.6.2 Join with resale_sf

resale.res.sf <- cbind(resale_sf, 
                        resale.mlr1$residuals) %>%
  dplyr::rename("MLR_RES" = "resale.mlr1.residuals")

8.1.6.3 Convert to SpatialPointsDataFrame

resale_sp <- as_Spatial(resale.res.sf)
resale_sp
class       : SpatialPointsDataFrame 
features    : 15853 
extent      : 11597.31, 42623.63, 28217.39, 48741.06  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 39
names       :   month,       town, flt_typ,            rmnng_l,          address, resale_price, floor_area_sqm, remaining_lease_final, lvl_01_TO_03, lvl_04_TO_06, lvl_07_TO_09, lvl_10_TO_12, lvl_13_TO_15, lvl_16_TO_18, lvl_19_TO_21, ... 
min values  : 2019-01, ANG MO KIO,  4 ROOM, 45 years 06 months,   1 CHAI CHEE RD,       218000,             74,                  45.5,            0,            0,            0,            0,            0,            0,            0, ... 
max values  : 2020-09,     YISHUN,  4 ROOM,           97 years, 9B BOON TIONG RD,      1186888,            138,                    97,            1,            1,            1,            1,            1,            1,            1, ... 

8.1.6.4 Display interactive point symbol map

tmap_mode("view")

tm_shape(mpsz_sf)+
  tm_polygons(alpha = 0.4) +
tm_shape(resale.res.sf) +  
  tm_dots(col = "MLR_RES",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))
tmap_mode("plot")

Results above show that:

8.1.6.5 Compute distance-based weight matrix

nb <- dnearneigh(coordinates(resale_sp), 0, 1500, longlat = FALSE)

8.1.6.6 Convert to spatial weights

nb_lw <- nb2listw(nb, style = 'W')

8.1.6.7 Perform Moran’s I Test for Residual Spatial Autocorrelation

lm.morantest(resale.mlr1, nb_lw)

    Global Moran I for regression residuals

data:  
model: lm(formula = resale_price ~ floor_area_sqm +
remaining_lease_final + lvl_01_TO_03 + lvl_04_TO_06 +
lvl_07_TO_09 + lvl_10_TO_12 + lvl_13_TO_15 + lvl_16_TO_18 +
lvl_25_TO_27 + lvl_28_TO_30 + lvl_31_TO_33 + lvl_34_TO_36 +
lvl_37_TO_39 + lvl_40_TO_42 + lvl_43_TO_45 + lvl_46_TO_48 +
lvl_49_TO_51 + cbd_prox + eldercare_prox + hawker_prox +
mrt_prox + park_prox + gd_prisch_prox + mall_prox +
supermarket_prox + kindergarten_num + childcare_num +
busstop_num + prisch_num, data = resale_sf)
weights: nb_lw

Moran I statistic standard deviate = 762.22, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Observed Moran I      Expectation         Variance 
    3.815577e-01    -4.496232e-04     2.511801e-07 

Results above show that:

9 Building Hedonic Pricing Models using GWmodel

In this section, we will build a hedonic pricing model using adaptive bandwidth. We will only use adaptive bandwidth here as using fixed bandwidth means that local regressions for small spatial units may be calibrated on a high number of dissimilar areas, while local regressions for huge areas may be calibrated on very few data points. This might lead to estimates with large standard errors. Hence, to mitigate this potential problem, we will use adaptive bandwidth here instead.

9.1 Building Adaptive Bandwidth GWR Model

In the adaptive bandwidth GWR model, the factors that are used to calculate the adaptive bandwidth and to calibrate the model are:

Note: Not all of the variables prepared above are used since it takes too long for the computation of adaptive bandwidth.

9.1.1 Compute adaptive bandwidth

bw.adaptive <- bw.gwr(formula = resale_price ~ remaining_lease_final +  cbd_prox + eldercare_prox  + mrt_prox  + gd_prisch_prox + supermarket_prox + busstop_num, data=resale_sp, approach="CV", kernel="gaussian", adaptive=TRUE, longlat=FALSE)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
Adaptive bandwidth: 9805 CV score: 7.476334e+13 
Adaptive bandwidth: 6068 CV score: 6.882415e+13 
Adaptive bandwidth: 3757 CV score: 6.173472e+13 
Adaptive bandwidth: 2330 CV score: 5.492212e+13 
Adaptive bandwidth: 1446 CV score: 4.52309e+13 
Adaptive bandwidth: 902 CV score: 3.758032e+13 
Adaptive bandwidth: 563 CV score: 3.243806e+13 
Adaptive bandwidth: 356 CV score: 2.805769e+13 
Adaptive bandwidth: 225 CV score: 2.467957e+13 
Adaptive bandwidth: 147 CV score: 2.242044e+13 
Adaptive bandwidth: 96 CV score: 2.056219e+13 
Adaptive bandwidth: 67 CV score: 1.913854e+13 
Adaptive bandwidth: 46 CV score: 1.820342e+13 
Adaptive bandwidth: 36 CV score: NaN 
Adaptive bandwidth: 55 CV score: 1.852645e+13 
Adaptive bandwidth: 43 CV score: 1.833366e+13 
Adaptive bandwidth: 50 CV score: 1.837026e+13 
Adaptive bandwidth: 45 CV score: 1.816257e+13 
Adaptive bandwidth: 43 CV score: 1.833366e+13 
Adaptive bandwidth: 45 CV score: 1.816257e+13 

Results above show that:

9.1.2 Constructing the adaptive bandwidth gwr model

gwr.adaptive <- gwr.basic(formula = resale_price ~ remaining_lease_final +  cbd_prox + eldercare_prox  + mrt_prox  + gd_prisch_prox + supermarket_prox + busstop_num, data=resale_sp, bw=bw.adaptive, kernel = 'gaussian', adaptive=TRUE, longlat = FALSE)

gwr.adaptive
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2021-11-06 17:56:32 
   Call:
   gwr.basic(formula = resale_price ~ remaining_lease_final + cbd_prox + 
    eldercare_prox + mrt_prox + gd_prisch_prox + supermarket_prox + 
    busstop_num, data = resale_sp, bw = bw.adaptive, kernel = "gaussian", 
    adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  resale_price
   Independent variables:  remaining_lease_final cbd_prox eldercare_prox mrt_prox gd_prisch_prox supermarket_prox busstop_num
   Number of data points: 15853
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-234925  -49126   -4042   41534  546951 

   Coefficients:
                           Estimate Std. Error  t value Pr(>|t|)    
   (Intercept)            3.614e+05  4.432e+03   81.548  < 2e-16 ***
   remaining_lease_final  4.422e+03  4.673e+01   94.646  < 2e-16 ***
   cbd_prox              -2.204e+01  2.008e-01 -109.770  < 2e-16 ***
   eldercare_prox        -9.578e+00  9.069e-01  -10.561  < 2e-16 ***
   mrt_prox              -2.092e+01  1.509e+00  -13.865  < 2e-16 ***
   gd_prisch_prox         3.021e+00  2.567e-01   11.769  < 2e-16 ***
   supermarket_prox      -1.532e+01  3.760e+00   -4.074 4.64e-05 ***
   busstop_num            1.578e+03  2.021e+02    7.806 6.25e-15 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 71400 on 15845 degrees of freedom
   Multiple R-squared: 0.6473
   Adjusted R-squared: 0.6471 
   F-statistic:  4154 on 7 and 15845 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 8.077277e+13
   Sigma(hat): 71384.54
   AIC:  399345.9
   AICc:  399346
   BIC:  383649
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 45 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                                Min.     1st Qu.      Median
   Intercept             -1.0292e+09 -6.2927e+05  2.7628e+05
   remaining_lease_final -2.9006e+04  3.2042e+03  4.8611e+03
   cbd_prox              -3.1551e+06 -1.2414e+02 -2.4163e+01
   eldercare_prox        -1.3512e+05 -4.2276e+01  1.5652e+01
   mrt_prox              -1.0449e+04 -1.1510e+02 -4.4726e+01
   gd_prisch_prox        -1.0891e+05 -1.2326e+02 -7.5310e+00
   supermarket_prox      -4.3780e+04 -6.0822e+01 -5.8996e+00
   busstop_num           -5.8612e+04 -2.4186e+03 -2.0376e+02
                             3rd Qu.       Max.
   Intercept              1.1670e+06 2.7264e+10
   remaining_lease_final  7.2330e+03 3.1653e+04
   cbd_prox               7.5793e+01 1.0092e+05
   eldercare_prox         7.2167e+01 1.2880e+06
   mrt_prox               3.4030e+01 2.3643e+05
   gd_prisch_prox         1.0899e+02 2.5232e+06
   supermarket_prox       5.9002e+01 6.3030e+03
   busstop_num            2.1914e+03 3.4619e+04
   ************************Diagnostic information*************************
   Number of data points: 15853 
   Effective number of parameters (2trace(S) - trace(S'S)): 1189.44 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 14663.56 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 375516.3 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 374435.9 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 366867.9 
   Residual sum of squares: 1.581865e+13 
   R-square value:  0.9309263 
   Adjusted R-square value:  0.925323 

   ***********************************************************************
   Program stops at: 2021-11-06 17:58:46 

Results above show that:

10 Visualising GWR Output

In addition to regression residuals, the output feature class table includes fields such as:

They are all stored in a SpatialPointsDataFrame or SpatialPolygonsDataFrame object integrated with fit.points, GWR coefficient estimates, y value, predicted values, coefficient standard errors and t-values in its “data” slot in an object called SDF of the output list.

10.1 Convert SDF into sf data.frame

We have to first convert the SDF to a simple feature data frame before visualizing them.

resale_sf_adaptive <- st_as_sf(gwr.adaptive$SDF) %>%
  st_transform(crs=3414)

10.2 Summary statistics

summary(gwr.adaptive$SDF$yhat)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 230357  356045  405707  433660  465218  990368 

Results above show that:

10.3 Visualising local R2

To see where GWR predicts well and where it predicts poorly, we can map the Local R2 values. This may also provide clues about important variables that may be missing from the regression model

tmap_mode("view")
tm_shape(mpsz_sf)+
  tm_polygons(alpha = 0.1) +
tm_shape(resale_sf_adaptive) +
  tm_dots(col = "Local_R2",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(11,14))
tmap_mode("plot")

Results above show that:

11 Conclusion

To reiterate, the factors used to calibrate the hedonic pricing model are:

These factors are able to explain majority of four-room HDB resale prices with transaction period from 1st January 2019 to 30th September 2020.

12 References

Correlation

Downtown Core

epsg.io

Geographically Weighted Regression

Geographically Weighted Regression Exercise

Gifted Education Programme

gps-coordinates.net

Michael Friendly

rowMins